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Abstract 

A detailed presentation of hypotiiesis testing is given. The "look elsewhere" ef- 
fect is illustrated, and a treatment of the trials factor is proposed with the introduc- 
tion of hypothesis hypertests. An example of such a hypertest is presented, named 
BumpHunter, which is used in ATLAS |T], and in an earlier version also in 
CDF (2), to search for exotic phenomena in high energy physics. As a demonstra- 
tion, the BumpHunter is used to address Problem 1 of the Banff Challenge ||3). 
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1 Introduction 

The goal of the BUMPHUNTER is to point out the presence of a local data excess like 
those caused by resonant production of massive particles in Particle Physics |T]. Such 
features are colloquially called "bumps", hence the name "BumpHunter". More 
specifically, the BumpHunter is a test that locates the most significant bump, where 
the data are most deviant from the Null hypothesis. Based on this bump, the test returns 
a p-value, corresponding to its Type-1 error probability. This is done in a way that 
accounts for the "trials factor". 

For the reader who may not be familiar with the terminology of hypothesis tests, 
a thorough discussion follows. Another account of hypothesis testing can be found 
in m. A similar discussion on trials factor can be found in [51. 

In the following paragraphs we spell out issues that are often misunderstood, such 
as the interpretation of p-values and the issue of "trials factor". A solution is provided 
to account for the latter, by introducing the notion of hypothesis hypertest. The dis- 
cussion that follows is not limited only to the BumpHunter; the latter is a practical 
application. 

After presenting the BumpHunter algorithm, a demonstration is made, based on 
Problem 1 of the Banff Challenge H. 

1.1 Hypothesis tests and p-values 

There are several statistical tests to evaluate if some data are consistent with a specific 
hypothesis. Two famous examples are Pearson's ^nd the Kolmogorov-Smirnov 
(KS) test. The BumpHunter is one more test in this category. 

In all tests of this kind, often called "hypothesis tests" or "goodness of fit tests", 
one has some data D and a hypothesis, which typically is the "Null", or "O-signal", or 
"background" hypothesis, denoted Hq. One could test the consistency of D with any 
hypothesis, but Ho is usually chosen, because typically a discovery can be claimed by 
establishing that the data are inconsistent with the "standard" theory, without having 
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to show necessarily that they are consistent with some ahernative theory. Once incon- 
sistency with Hq is estabHshed, several alternative signal hypotheses can be tested to 
characterize the discovery. For example, we can assume that the signal follows a spe- 
cific distribution, and estimate its amount, either by bayesian inference, or by defining 
frequentist confidence intervals (CIs). It helps, conceptually, to distinguish hypothesis 
tests, like or the BUMPHUNTER, from bayesian inference and frequentist Cl-setting 
method^ One can use as observable the value of or of the BUMPHUNTER statistic 
(defined below) to make a bayesian inference or to set a frequentist CI on the amount 
of a specific signal that may exist in the data, but the BUMPHUNTER is designed to 
address a different question, for which only D and Hq are required, and no specific 
signal is assumed, hence its model-independence. 

All hypothesis tests, including the BUMPHUNTER, work as follows: 

1 . D is compared to Hq, and their difference is quantified by a single number. This 
number is called "the statistic" of the test, or "test statistic", and in this document 
it is denoted by t. For example, in the x"^ test, the statistic is 



where dj denotes the observed events in bin ;, and the events expected by Hq 
in the same bin. The statistic in the KS test is the biggest difference between 
the cumulative distribution of the data and the cumulative distribution expected 
by Hq. We will present later the exact definition of the BUMPHUNTER statistic, 
but it follows the same logic: the bigger the difference between data and Hq, the 
bigger the test statistic. 

2. Pseudo-data are generated, following the expectation of Hq. In each pseudo-data 
spectrum, the same test statistic is computed, comparing the pseudo-data to Hq. 
The distribution of test statistics from pseudo-experiments is made. The achieve- 
ment of Pearson, Kolmogorov and Smirnov, was that they calculated analytically 
the distribution of the statistic of their tests under Hq. For example, Pearson 
showed that, under some assumptions of gaussianity, his X" statistic follows a 
;if''-distribution. Nowadays, computers make it possible to estimate numerically 
the distribution of any test statistic. 

3. Calculate the p-value of the test. The p-value is the probability that, when Hq 
is assumed, the test statistic will be equal to, or greater thar0 the test statistic 

' There is, actually, a connection between hypothesis tests and frequentist CIs, which will be explained 
in this footnote, hoping to avoid confusion. One can assume any kind of signal, and set a lower hmit to 
the amount of this signal that may exist in D, using the classical Neyman construction, where the statistic 
of some test is used as observable; to be specific, let's say x~ is used to construct the Nayman band. If the 
resulting CI doesn't contain the value for signal, then Hq is excluded, in the frequentist sense, namely in 
the sense that 0-signal is not contained in a CI characterized by some Confidence Level (CL). The smallest 
CL for which the corresponding semi-infinite CI includes the value for signal, is equal to ( 1 — value) of 
the hypothesis test (of the test in this case) which compares D\o Hq. This is the case for any assumed 
signal shape. 

^The convention used is that the test statistic becomes greater as the discrepancy increases; otherwise the 
/)- value would be defined as P{t < to\Ho). 
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obtained by comparing the actual data (D) to Hq: 



p-va\ue = P{t>t„\Ho), (2) 

where the test statistic f is a random variable since it depends on how pseudo- 
data fluctuate around Hq, and is the observed statistic from comparing D to Hq. 
If the exact probability density function (PDF) of f under Hq is known (p (f \Ho)), 
then the p-value is exactly computed as p {t\HQ)dt . When p(r|//o) is esti- 
mated using pseudo-experiments, as the case is for the BUMPHUNTER, then the 
/7-value is estimated as a binomial success probability. Using Bayes' theorem, if 
pseudo-experiments are produced, of which S had t > to, we infer 

p(/7-value|A^,5) = (^^^ />value^(l - p-value)^'^ 7t{p-yaluc) ^ 

where c/K is a normalization constant, and 7r(;9- value) is the prior assumed. If 
we assume ;r(p-value) = 1, which is a reasonable choice, the result becomes 

/7(p-value|A^,5) = />value^(l - p-value)'^"^(l +A^). (4) 

According to this posterior distribution, the most likely p-value is ^. 

So, the final product of a hypothesis test of this kind is a p-value. Ideally, the 
/7-value would be precisely computed, but in practice is has to be estimated from a finite 
set of pseudo-experiments. We will explain next how the p-value can be interpreted, 
and why it is so useful. 



1.2 What does the p-value mean? 

It will be shown that the p-value is interpretable as a false-discovery probability. To 
reach systematically to that interpretation, and to clarify what that means, we will first 
prove a simple theorem. 



1.2.1 A simple theorem about p-values 

Assume a decision algorithm which declares discovery (i.e. it rules out//o) if p-value < 
a, where a G [0, 1] is an arbitrary parameter of the algorithm. It will be shown that the 
probability of this algorithm to wrongly rule out Hq is a, no matter what hypothesis test 
the p-value is coming from, under one condition; that there be a solution for which 

Pt{x)dx = a, where Pt is the PDF followed by the test statistic (f ) under Hq. 

The probability to wrongly rule out Hq, which is named "Type-I error", is the prob- 
ability to find p-value < a while Hq holds, namely 

P(Type-I) = P(p-value < a|//o), (5) 

which can be spelled out more clearly, using the definition of p-value: 

P(Type-I) = P{P{t > to\Ho) < a\Ho). (6) 
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In eq.|2l t was a random variable and was a fixed number, which depended on the real 
data D and on Hq. In eq.|6] both t and to are random variables, because we don't have a 
fixed observed dataset D; we are instead trying to calculate the probability that D will 
be such that f„ will satisfy P{t > f„|//o) < cc. In other words, eq. |6]is the probability 
of drawing a random variable f„, such that the random variable t will have probability 
less than a to be greater than f„. That happens if to > where Pt{x)dx = a, where 
pt is the PDF followed by f The probability for f„ to be greater than ^ is p,^ {x)dx, 
where p,^^ is the PDF followed by ?„. So, eq.|6]can be written 



By looking back at eq. |6l we see that t fluctuates according to how pseudo-data fluc- 
tuate around Hq, as implied by the conditional in P{t > to\H()). At the same time, to 
fluctuates according to how pseudo-data fluctuate around Hq, as implied by the right- 
most conditional in eq. |6] So, both f and to are drawn from the same distribution, 
namely Pr„(jf) ~ Pt{x)- Therefore, eq. |2]becomes 



This is an important result, and it is what makes p-values useful. We showed that, 
no matter how we define the test statistic f , if we use the resulting p-value in a discovery 
algorithm that declares discovery when p-value < a, the Type-I error probability of 
that algorithm will be equal to a. The only requirement is for a ^ to exist that satisfies 

p,{x)dx = a. 

Corollary: If a test statistic f follows a continuous PDF p, under Hq, then the con- 
dition of the above theorem is satisfied for any value a E [0, 1], therefore P(;?-value < 
a\Ho) = a Va, therefore the /9-value of any such hypothesis test is a random variable 
that follows a uniform distribution between and 1, when Hq is true. 

Note that if p, is discontinuous, then this corollary does not follow, i.e. the p-value 
does not follow a uniform distribution between and 1, but the previous theorem is still 
valid for a values for which pt{x)dx = a has a solution. This is important, because 
it is often wrongly thought that if a p-value doesn't follow a uniform distribution under 
H(), then it can not be correctly interpreted as a Type-I error probability. That is not 
true. In paragraph ll.2.2l we will see why. 

' The equation p, (x)dx = a needs to have a solution; if f doesn't exist, the rest of the proof fails. For 
example, consider Pi{x) = j -+ ^S{x—0.5), where x 6 [0, 1] and 5( ) is the Kronecker S function. In this case, 
there is no ^ that satisfies p,(x)dx = a for a e [0.25,0.75), because if C > 0-5 then p,{x)dx < 0.25, 
and if if < 0.5 then p, (x)dx > 0.75. If p, (x) is continuous, then a f exists Va. Most test statistics based on 
event counts don't follow a continuous PDF, due to event counts being discrete. Another possible reason to 
not follow continuous PDF is the imposition of conditions as we will see paragraphia] So, there are specific 
values of a for which this theorem is exactly true; in other cases the probability to wrongly exclude Ho is 
not exactly a. However, we will explain in paragraph 1 1 . 2 . 2 1 that this theorem's condition is met if we set a 
equal to an absented p-value, which allows any observed p-values to be exactly interpreted as a Type-I error 
probabilities. 





^ f (Type-I) 



a 



(8) 
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1.2.2 Interpretation of the p-value of a test 

So, how should we interpret the p-value of a hypothesis test checking the consistency 
of a dataset D with a hypothesis Hq7 

Exploiting the theorem of paragraph ll.2.11 if we observe /j-value = 7 we know that 
there is a discovery algorithm which would have ruled out Hq based on this /j-value 
with Type-I error probability equal to 7. That algorithm is the one with parameter 
a = 7. If we set a < 7 then the algorithm wouldn't declare discovery for the observed 
p-value. If we set a > 7 a discovery would still be declared, but such an algorithm 
would have a larger Type-I error probability, so it would be less reliable. Therefore, 
if we observe p-value = 7, then the discovery algorithm with the smallest Type-I error 
probability that would still declare discovery would do so with probability 7 of being 
wrong. In this sense, the observed p-value is a false-discovery probability. It is the 
smallest false-discovery probability we can have, if we declare Hq to be false. 

What if the hypothesis test t follows a discontinuous PDF p(f)? We saw in ll.2.11 
that in that case there can be some values of a for which the proof can not proceed, 
because there is no ^ satisfying p{t)dt = a. That, however, does not interfere 
with the interpretation of an observed p-value = 7 as a Type-I error probability. The 
reason is that 7 will always be such that p{t)dt = 7 will have a solution, so, the 
theorem of paragraph ll.2.11 will always hold, if we set a = 7. How do we know that 
any observed /9-value = 7 will always be such that p {t)dt = 7 will have a solution? 
We know, because otherwise 7 couldn't have been observed. Let's take, for example, 
the discontinuous PDF that was mentioned in paragraph 11.2.11 p(f) = j + jd{t — 
0.5), t G [0,1]. For this p(f), as mentioned earlier, the equation p{t)dt = a has 
no solution for a e [0.25,0.75), but this is precisely the range where 7 couldn't be 
in any circumstance. If f„ > 0.5, then the p-value will be < 0.25. If < 0.5, then 
p-value > 0.75. 

We showed, therefore, that any observed p-value will always be interpretable, 
thanks to the theorem of paragraph ll.2.11 as the smallest possible Type-I error proba- 
bility of a discovery algorithm which would have declared discovery on the basis of the 
observed p-value. This interpretation will be correct even if the conditions are not met 
for the corollary of |1.2.1| to be true, i.e. even if the p-value is not distributed uniformly 
in [0, 1] under //q. 

To prevent a common misinterpretation, if we find a p-value = 0.7, it doesn't mean 
that Ho is right with probability 70%. In strictly frequentist terms, the p-value is not 
a statement about Hq itself, but about the Type-I error probability of an algorithm that 
would exclude Hq, as explained above0. 

1.3 Interpretation of multiple tests 

If we run the KS test and find p-value = 0.7, we know that even the most reliable de- 
cision which would rule out Hq on the grounds of the KS test would still have 70% 
probability to be wrong. With so high odds of being wrong, we couldn't support a dis- 
covery claim. But the fact KS doesn't identify a big discrepancy doesn't mean no other 

''Equivalently, the p-value corresponds to the CL of a specific CI. See footnote[T] 
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test will. For example, the data D may follow the PDF predicted by Hq, but have differ- 
ent population. Since the KS test compares cumulative distributions, it is insensitive to 
an overall normalization difference, while the test would notice it. So, if the test 
returns p-value = 10^^, we can say that the most reliable decision which would rule 
out Hqi on the grounds of the test would have probability 10^^ to be wrong. With 
such high confidence, a discovery claim could be supported. This statement from x^ 
does not contradict the one from KS. Both are correct, simultaneously. One says that 
the D distribution shape agrees with Hq; the other says that the normalization doesn't. 

The above scenario illustrates why one can benefit from more than one statistical 
test. Each test is sensitive to different features, and we may not know a-priori how D 
may differ from Hq. Unless one is willing to limit the scope of his search to only one 
kind of discrepancy (e.g. shape discrepancy or normalization discrepancy), he needs to 
compare D to Hq in more than one way. To do so correctly, he must carefully take into 
account the "trials factor", which is the subject of the next paragraph. 

1.3.1 Ad-hoc tests, and trials factor 

Reading paragraph ll.3l one may be tempted to "engineer" more hypothesis tests, until 
one of them gives a small p-value that would allow him to rule out Hq with great 
confidence. For example, imagine that the data D are binned in 10'* small bins. In so 
many bins, it is only natural for one bin to fluctuate significantly from the Hq prediction, 
even if Hq is true. If a hypothesis test is engineered to look just at that bin, then the 
observed statistic (fo) will be very large, and the p-value will be very small, because 
pseudo-experiments will very rarely have as big a discrepancy in the same bin. 

Even for such an ad-hoc test, everything we proved still holds. It would be techni- 
cally correct that based on this a-posteriori decided test we could rule out Hq with a 
tiny chance of being wrong. And yet, any minimally skeptical scientist should refuse 
to rule out Hq based on this result. All it says, in essence, is that there is one out of the 
lO'* bins that is very discrepant. If we had stated it like that, it wouldn't have sounded 
so dramatic, but that's really what it means, and the reason is that the bin had not been 
chosen a-priori, but after seeing D. If a different bin had fluctuated far from Hq, then 
another a-posteriori test would have been quoted, which would again rule out Hq with 
high confidence, even if Hq were true. This is what physicists refer to as "the look 
elsewhere effect", or "the trials factor", implying that each bin counts as a trial with 
its own chance of triggering a discovery, and the fact there are many such trials has to 
be taken into account somehow. It will become clear later that the "trials" actually are 
not due to the many bins, but due to the many possible hypothesis tests one would be 
interested in considering simultaneously. In other words, the "look elsewhere effect" 
may better be called "look in different ways effect". 

1.3.2 How to account for the trials factor - Hypertests 

Continuing the example of the previous paragraph, to see if there is a single-bin fluc- 
tuation that is too unlikely under Hq, without having any prior preference to some bin, 
we will come up with a statistical test that considers all possible bins on an equal foot- 
ing. It will have, like every hypothesis test, a statistic t and a p-value corresponding to 
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the observed statistic f„. Its p-value will follow the theorem of paragraph ll.2.11 it will 
therefore be interpreted as the Type-I error probability based on this test. 

The hypothesis test that looks at all bins can be viewed as a hypertest, which com- 
bines all the specialized tests which focus on individual bins. These many tests are the 
many ways in which a discovery could be claimed. These many tests are the "trials". 
We will see how to construct such a hypertest. 

In our example, where the data D are partitioned in = 10^ bins, one could define 
hypothesis tests, each using one bin to define its test statistic. Of these hypothesis 
tests, each can use any test statistic; they don't even have to be the same. For example, 
for hypothesis tests that examine odd bins we could define the test statistic 

tieoiAs = {di-bif, (9) 

where dt and bi are the observed data and the expectation of Hq in the bin / where each 
hypothesis test focuses. For hypothesis tests that examine even bins, we could define 
the test statistic 

f/Gevens^W-^)'"". (10) 

No matter how we define these hypothesis tests, regardless how numerically different 
their statistics may be, for each one of these tests there is an observed statistic f,o, 
and the corresponding /?-value,- in the interval [0, 1]. For each one of these A^ /^-values, 
the theorem of paragraph 11 .2. 1 1 holds: If Hq is true, then each one of these tests has 
probability a to return p-value, < a. 

In this example the A^ tests are independent, meaning that 

P(/:>-value,- < a|/7-valuey < a) = P(/7-valuei < a) V{!,;}, 

so the probability of at least one such hypothesis test giving a p-value, < a is 

N 

f (at least one test /7-value < a) = 1 — ]~[P(p-value,- > a) 

i=l 

= l-(l-af (11) 

In this case we may use the phrase "the trials factor is A^", meaning that this set of 
hypothesis tests consists of A^ statistically independent tests. If, on the contrary, all A^ 
tests were totally correlated, meaning that 

P(/:)-value/ < ajp-value^ < a) = 1 V{!,y}, 

then we would have 

f (at least one test p-value <a) = ^(p-value; < a) V/ 

= a = l-(l-a)' (12) 

In this case, we may say "the trials factor is 1", meaning that, although there are many 
(A^) hypothesis tests in the set we are considering, they count as 1 because they behave 
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identically. In any intermediate case of partial independence, we can define a real 
number N such that 

f(at least one test p-value < a) = 1 — (1 — a)^ 

N = log(i_„)(l — P(at least one test p-value < a)). (13) 

We can refer to N as the effective trials factor, which can take values between 1 and N. 
The value of N depends on A^, on the way the hypothesis tests are correlated, and on 
a. It should be clear at this point that the trials factor has little to do with how many 
bins there are in the data, or how many final states we consider in a search for new 
physicfl It is really a function of the number of hypothesis tests that we employ, and 
of how their answers correlate. 

We just showed that a discovery algorithm that says "declare discovery if any of the 
tests gives p-value < a" does not have Type-I error probability equal to a, but equal 
to 1 — (1 — a)^ which is > a. That is why we cannot look at a set of hypothesis tests 
(e.g. tests, each looking at a different bin), pick the smallest p-value, and interpret 
that as a Type-I error probability. 

There is a way to account for the trails factor, by defining a new hypothesis test that 
is sensitive to the union of the features that each of the A^ tests is sensitive to, and has 
a p-value which can be interpreted as a Type-I error probability. This new test will be 
combining A^ hypothesis tests, and use as statistic the following: 

/ = — log(min{p-value,}). (14) 

In words, this new hypothesis test uses as statistic the smallest p-value,. The negative 
log function is used to make t increase monotonically as min{p-value,} decreases, fol- 
lowing the convention that wants t to increase with increasing discrepancy. Obviously 
the log function could be replaced by any other monotonically increasing function. 

We refer to the new test as a hypertest, i.e. a union of many tests, because its statistic 
is a p- value of some other hypothesis test from a pre-determined set of hypothesis tests. 

Every hypertest has an observed statistic r„ and a corresponding p-value, found 
as described in paragraph 11.11 This p-value quantifies how often such a small (or 
smaller) p-value would be returned by at least one of the A^ hypothesis tests included 
in the set, under //q. The p-value of this hypertest, like any p-value, obeys the theorem 
of paragraph II. 2. II The p-value of this hypertest can be interpreted as described in 
paragraph ll.2.21 

1.3.3 Final remarks on the definition of hypertests 

In paragraph 11.3.21 we gave a prescription to correctly consider simultaneously a set 
of hypothesis tests, by defining a hypertest that takes into account the trials factor, 
and returns a p-value that can be correctly interpreted as a Type-I error probability. 
The obvious question is which hypothesis tests to include in the set used to define the 
hypertest. 

''More bins aiid more final states allow one to devise more hypothesis tests, but one doesn't have to. 
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There is no unique answer. By including more (independent) hypothesis tests to the 
set, the hypertest gains sensitivity to more features. That can be desirable, especially 
when we have no prior expectation of how D may differ from Hq. The price one pays is 
that the effective trials factor (.^V) increases, so, the power of the test decreases, namely 
it would take more signal to obtain the same /9-value from the hypertest. 

If we knew somehow that D would differ from Hq in a specific bin, there would be 
no need to get distracted by looking in any other bin. 

A reasonable strategy, which is adopted also by the BUMPHUNTER, is to specify a 
set of hypothesis tests which cover a large family of similar features. For example, the 
BumpHunter, as we will see, is a hypertest based on the set of hypothesis tests that 
look for bumps of various widths in various locations of the spectrum. The interpreta- 
tion of such a test is rather simple. If the p-value is not small enough, we conclude that 
there is no significant bump of any width, at any location. 

One final remark is that a hypertest A may be included in the set of hypothesis tests 
used by a hypertest B. That doesn't make B a hyper-hypertest or something. Both B and 
A are hypertests, because their p-values are the result of considering simultaneously the 
p-values of a set of hypothesis tests (or hypertests). It is also trivial to show that if a 
hypertest A contains in its set just one hypothesis test (or hypertest) B, the ;9-value of 
A is identical to the p-value of B, so the distinction between simple hypothesis test and 
hypertest gets lost in the trivial case. 

2 The BumpHunter 

The BumpHunter scans the data (D) using a window of varying width, and keeps 
the window with biggest excess of data compared to the background (Hq). This test 
is designed to be sensitive to local excesses of data. The same treatment is given to 
pseudo-data sampled from Hq, and the /9-value is estimated as described in paragraph 

In the language of paragraph ll.3.2l and ll.3.3l the BumpHunter is a hypertest that 
combines hypothesis tests which focus on bumps of various widths at various positions 
of the spectrum, taking the trials factor into account. 

It will become clear that some choices have been made in this implementation 
of the BumpHunter which could be different. For example, one may use different 
sideband definitions, or may search for bumps within some width range. As explained 
in paragraph |1.3.3l such choices are essentially arbitrary. They are made based on what 
we wish the interpretation of the result to be. 

This version of the BumpHunter operates on data that are binned in some a- 
priori fixed set of bins. In the limit of infinitesimally narrow bins, the arbitrariness of 
the binning choice is removed. If the bins are not infinitesimally small, then their size 
limits the narrowest bump that one may be sensitive to. In most applications there is 
a natural limit to how narrow a bump can be. For example, in Uj the limit reflects 
the finite detector resolution. Practically, one can have very good performance using 
bins of finite width. In the case of the Banff Challenge, the information is given that 
the signal follows a Gaussian distribution with a = 0.03, so, we define 40 equal bins 
between and 1, resulting in bin size 0.025. 
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Given some data D and some background hypothesis Hq, the following steps are 
followed to obtain the test statistic (f) of the BUMPHUNTER: 

1. Set the width of the centra^ window Wc- In this implementation, where the 
data are binned, Wc is an integer which specifies how many consecutive bins 
to include in the central window. This width is allowed to vary between some 
values. In HI, where the potential signal is of unknown width, Wc is allowed 
to range from 1 to [^J, where is the total number of bins from the lowest 
observed mass to the highest. To address the Banff Challenge fS), where the 
signal is a Gaussian of known a = 0.03, we constrain Wc between 3 and 5 bins, 
which fit roughly 68% to 95% of this Gaussian signal. 

2. Set the width of each sideband. Sidebands are used, optionally, if one wishes to 
impose quality criteria ensuring that the BUMPHUNTER will focus on excesses 
surrounded by non-discrepant regions. In IH such sidebands were used, and 
their size (in number of bins) was set to max{l, [^J}. To address the Banff 
Challenge, we do not use any sidebands, in the interest of speed, and because 
there is some risk associated with using sidebands when Wc is constrained to 
small values; this risk is illustrated in paragraph 14.31 In the following steps 
we will describe how sidebands are used, because they constitute part of the 
BumpHunter algorithm, even though in the Banff Challenge they are not used. 

3. Set the position of the central window, which will range from the lowest to the 
highest observed valufl 

4. Count the data {dc) and background {be) in the central window. Obviously dc 
is an integer and be is a real number, representing the expectation value, ac- 
cording to Hq, in the central window. Similarly, count the data {di, dR) and 
the background {b^, bg) in the left and right sideband (subscript "L" and "R" 
respectively). 

5. In this step, which is at the heart of the BUMPHUNTER, we will make a con- 
nection to what was said in paragraph 11.3.21 We will define the test statistic 
t of each one of the hypothesis tests that are combined in the BUMPHUNTER 
hypertest. Each local hypothesis test examines the presence of a bump at the 
location where we are currently placing the central window as we scan the spec- 
trum. Each such hypothesis test has its statistic f , which has an observed value to 
coming from comparing the data D to Hq, resulting in a p-value. The smallest of 
these /?- values will be used in step [8] to define the BUMPHUNTER test statistic, 
according to paragraph ll.3.21 

Given the six numbers di^i c.R} ^"d bi^i c.R}^ we define the following test statistic 

^"Central window" is the window where excess of data is checked for. The word "central" is used to 
distinguish that window from its left and right sideband. 

'Dijet mass in the case of (T|, or x in the case of the Banff Challenge. 
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t for the hypothesis test which focuses on the current window and sidebands: 

^ [O if lie < be or ^(dL^bt) < 10"^ or S^{dR,bR) < lO-\ 

I f{dc — be) otherwise. 

(15) 

In this definition, / can be any positive, monotonically increasing function, such 

as {de~be)^ or {de-beY'^. Also, 

^{d,b) = l (16) 

Ignoring the sidebands is equivalent to using instead of 10^^ in eq. [15] The 
definitior0of eq.[T5]was carefully designed to have the following characteristics, 
which make it meaningful and practical: 

• f >0. 

• f = 0, i.e. the discrepancy is characterized maximally uninteresting when 
the data, where the particular hypothesis test focuses and t is computed, do 
not meet the following criteria which a bump would be expected to meet: 

(a) Have an excess of data in the central window, namely de > be- And 

(b) , have both sidebands consistent with the background. That is where the 
two ^{dx,bx) with X — {L,R} are employed. Each one of these is the 
/?-value of a hypothesis test that focusses on just the left or right sideband, 
and uses as test statistic f = \dx —bxl, or something similar that increases 
monotonically with the difference between data and background in each 
sideband. By requiring ^{dx,bx) to be greater than 10^^^, we require that 
Ho can not be excluded, based on event counts in the sideband, with less 
than 10^^ probability of being wrong. The value 10^^ is arbitrary, and can 
be set higher or lower to tighten or relax, respectively, the good sidebands 
requirement. 

• The p-value of this hypothesis test is analytically calculable directly from 
d{L.c.R} ^{L,C,K}' without even having to calculate ? or /o! We will soon 
explain how. This remarkable property allows the BUMPHUNTER statistic 
to be computed quickly, without needing pseudo-experiments to estimate 
the p-value of each local hypothesis test that it incorporates. 

The p-value is computed as follows. We have the observed events de,,, dia and 
dRo- If deo < be, we don't have an excess, so we know that the observed statis- 
tic to is according to eq. [15] therefore any other pseudo-experiment would 
have t > to, therefore /7-value = 1. The same is true, for the same reason, if 

'^As an aside, in footnote[3]it was mentioned that paragrapli|2]would illustrate an example of a test statistic 
which doesn't follow a continuous distribution p,. Indeed, the test statistic t of eq.flSjis discontinuous at 0. 
Due to the condition which may set f to in some cases, the PDF of / contains a peak at which could be 
formulated as a Kronecker 5{f — 0) multiplied by the probability for t to be 0. 
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^{dLo,bL) < 10-3 or ^[dRoM) < 10'^ When none of the above happens, t 
is defined to increase as dc increases, since f{dc — be), in eq. [15] is monoton- 
ically increasing and be is fixed. So, when ?„ > 0, we know that the only way 
t would be > to is by having dc > dco, while di and dR remain consistent with 
bi and bR. To find the p-value, which by definition is P{t > to\Ho), we have 
to compute the probability of these three things happen simultaneously. The 
conditions on di and du were designed to be independent from each other and 
from dc- This allows us to express the p-value as the product of 3 probabilities: 
P{dc > dco\Ho), P{3^{dM > 10-3|i/o), ^ndP[^{dRM) > lO-^\Ho). The 
first probability is, by definition, ^{dco,bc)- The second and third probabilities 
are equa^to (1 — lO^^), because of the theorem of paragraph ll.2.l[ and because 
3^{di,bL) and ^{dR,bR) are p-values. Putting it all together, we have: 

/ 1 if dco < be or ^^{du,M) < 10'^ or .'^{dRo,bR) < IQ-^ 
p-value = < , , (17) 

\^{dco,bc)ii-iO-^)- otherwise. 

The term (1 — lO^^) is very close to 1, but even if it wasn't, it could be ignored 
because it is constant of all local hypothesis tests, therefore it affects neither 
which p-value will be the smallest (see step[8]), nor the BUMPHUNTER p-value. 

After all, we have shown that the p-value of eq. [TT] depends on three ^{d,b) 
values, which are analytically calculable quantities, using the well-known func- 
tion r{d) = t''^^e^'dt and its normalized lower incomplete version, which is 
also tabulated in standard computational packages code libraries, like the ROOT 
TMath class Q. The useful relationship that allows this computation is: 

from which it follows that: 

{r{d,b) if d>b, 

(19) 
l-r{d+l,b) if d<b. 



6. Shift the central window, and its sidebands, by a number of bins, and repeat step 
|5] namely compute the p-value of the local hypothesis test that focuses on that 
new location. In principle, the bins could be infinitesimally narrow, and the trans- 
lation could be in infinitesimally small steps, to include in the BUMPHUNTER 
every possible bump candidate (or, equivalently, every possible hypothesis test 
focusing on a local mass range). However, in practice there are computational 

'This equality is only approximate, due to dx being integer. It is, however, a very good approximation. 
Due to dx taking discrete values, so does 3^{dx,bx)- For example, if bx = 1-5, then to have ,'S^{dx,l.5) > 
10"^ dx has to be < 7, and that has probability £^^o ^^'^'^ = 0.99074 instead of 0.999. If bx = 0.001, 
then the same probability is I^JJ^q 2J]||l_g"0.00l _ 0.9990005, and for large values of bx the approximation 
becomes better because the discreteness of dx becomes negligible. 
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limitations. Hypothesis tests which focus on roughly the same mass range are 
highly correlated. By adding more highly correlated tests not much new infor- 
mation is gained, the effective trials factor .^V doesn't increase much (see eq. [13), 
but it takes time to compute the /^-values all these tests. For this reason, in the 
implementation of the BUMPHUNTER used in [1] and in the Banff Challenge we 
use 

step size = max{ 1 , L J } ■ 

In this way we still consider bump candidates which overlap significantly, but 
we avoid spending time to consider almost identical bump candidates. 

7. Repeat the above steps for all desired values of Wc, as they were described in step 
[T] For every choice of Wc and every location of the central window, compute the 
corresponding /?- value as in eq.fTTl 

8. In this last step, the BUMPHUNTER test statistic t is calculated, according to 
eq.IH 

r = -log/?- value™", (20) 
where p-value™" is the smallest of all p-values found in the previous steps. 



2.1 The background and pseudo-data 

Like in all hypothesis tests (e.g. j^, KS etc.), in the BUMPHUNTER the Hq is an input. 
The BumpHunter uses the Hq, its p-value depends on it, but it doesn't define Hq. 
Depending on how the analyst defines Hq, the interpretation of the BUMPHUNTER, or 
any other hypothesis test, will have different interpretations. 

In particle physics, Hq may come from Monte Carlo (MC) simulation, representing 
typically the Standard Model prediction. Then, everything we have discussed so far 
applies. The MC -based background is used 

1. to compare D to Hq, thus obtaining the observed BUMPHUNTER statistic ?„, 

2. to generate pseudo-data according to Hq multiple times, 

3. to obtain the BUMPHUNTER statistic t by comparing each pseudo-data spectrum 
IoHq. 

Then the BUMPHUNTER /7-value is estimated, according to paragraph ll.il 

In some cases, it is well-motivated to formulate Hq as a function of D, instead of 
using MC. Specifically, in ||T| and in the Banff Challenge, the background is not inde- 
pendent of D. It is obtained by fitting a function to D. In the case of Banff Challenge 
we have the information that the background should follow an exponential spectrum 

B{x)^Ae-^\ (21) 

In the case of fTI, studies showed that there is a more complicated functional form 
which can fit the Standard Model prediction, but couldn't fit a spectrum with a reso- 
nance. One can define as Hq the result of fitting this functional form to the data D. This 
definition of the null hypothesis may be called "smooth background hypothesis". 
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When Hq depends on D, it is necessary to compute Hq (i.e. by re-fitting) not only for 
the actual data D, but also for every pseudo-experiment that will be used to estimate the 
/7-value. Otherwise Hq is not consistently defined, which means that in theorem [1.2. 11 
Pt and pt^ are not identical, thus the p-value is not interpretable as a Type-1 error prob- 
ability. 

2.1.1 Fitting by omitting anomalies 

When Hq is computed by fitting D there is the concern that, if a bump actually exists, 
it will influence the fit. Naturally, the fitted background will try to accommodate part 
of the signal, even if it doesn't have the flexibility to fully do so. That can obscure the 
signal, and cause the fit to not describe the data even where they don't contain signal. 
Fig-fflshows such an example. 

An alternative is to define Hq as the spectrum obtained by fitting the data, after 
omitting the window which improves the fit in a pre-determined, algorithmic way. The 
algorithm used in the Banff Challenge is to try the fit after omitting various windows, 
similar to the way the BUMPHUNTER scans the spectrum (paragraphia. The windows 
that are omitted have size between 3 and 5 bins, corresponding to width of potential 
signal, and they are considered for exclusion only if they contain an excess of data. If 
after the omission of some window the %" test p-value becomes greater than 0.1, then 
we consider the fit good enough and we stop looking for other windows to possibly 
omit from the fit. If the fit is not made better than that after the omission of any 
window, then we keep the fit which gave the greatest /5-value, even if it was less 
than 0. 1 . An example of this algorithm in action is shown in Fig. [T] where the window 
with the bump is automatically excluded, resulting in a much better fit of the rest of the 
spectrum. The same algorithm, obviously, is used each time we fit pseudo-data. 

The advantage of omitting the most discrepant region is that it pronounces the 
bump, as one sees in Fig. [T] Also, if the goal of the fit is to estimate the background 
parameters, e.g. the value of A in eq. [21] then this allows for the fit to find the right 
value of A without bias caused by the sig nal0 

Besides these advantages, nothing would be wrong about the results of the BUMPHUNTER 
even if one didn't follow this fit procedure. If we define Hq as the result of fitting the 
whole spectrum, then the BUMPHUNTER (and any other test) returns the right /9-value 
that reflects this definition. If the /9-value indicates a significant discrepancy between 
D and Hq, it is clear what Hq means and what the interpretation is. In other words, 
the BumpHunter (like any test) operates with the input D and Hq, not caring how 
well-motivated Hq is; that is up to the analyst. 

'"However, in the specific case of the Banff Challenge this is not how we estimate A, because we have the 
information that the signal follows a Gaussian of known width, so, it is better to fit the background of eg. 1211 
simultaneously with a Gaussian. The primary goal of the BUMPHUNTER is not to estimate parameters, but 
to test Hq. 
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3 The Banff Challenge, problem 1 



The Banff Challenge O, Problem 1, offers an opportunity to demonstrate BumpHunter's 
performance. 

Hq is defined as the spectrum obtained by fitting the data with eq. |2T1 following 
the algorithm of paragraph 12. 1.11 The BUMPHUNTER /j-value is estimated using the 
procedure of sec. II. II generating pseudo-experiments until we are sure (in the bayesian 
sense described in ll.ll i that the p-value is smaller or greater than 0.01 with probability 
> 0.999. If the p-value is estimated to be < 0.01 (with probability > 0.999), we declare 
discovery; if the p-value is estimated to be > 0.01 (with probability > 0.999), then we 
don't. 

Then comes the challenge of estimating the parameter A of the background and 
the position E of the signal (if discovery was declared). We go one step further, and 
estimate also the amount of signal (D). We do all that by fitting to the data the function 

\ _(£-££ 

f{x)^Ae-^'' + D^= e . (22) 

This fit has free parameters {A,C,D,E}. We use the result of the BUMPHUNTER to 
aid it; the initial value of E is set to the position where the BUMPHUNTER located the 
most significant bump. 

All data are studied after binning them in 40 equal bins of x between and 1 . (Bin 
size = 0.025.) If the actual A is 10"* the fit will return roughly 1O'*/4O=25O0 

We executed the BUMPHUNTER and the subsequent 4-parameter fit to all 20000 
distributions handed out with the Challenge. The results are tabulated in a separate, 
long text file, with the columns: 

• Dataset number (from to 19999) 

• Decision : means "most likely estimated p-value > 0.01, thus no discov- 
ery claim." 1 means "most likely estimated p-value < 0.01, thus discovery is 
claimed." 

• p-value estimate. For example, the string 

0.0666667 = 6/90 P(pval>0.01)= 0.999961 

condenses the following information: 90 pseudo-experiments were generated. 6 
of them had a BUMPHUNTER statistic greater than the BUMPHUNTER statistic 
observed in the actual data. That means that the most likely value for the p-value 
is 6/90 = 0.067. According to the bayesian posterior described in paragraph ll.il 
the /7- value is greater than 0.01 with probability 0.99996 iFl. So, it is safely above 
0.01, and in this case we don't declare discovery. Let's see another example: 

= 0/690 P(pval<0.01)= 0.99904. 
' ' This is tlie result of not using tlie option ' I ' wlien fitting in ROOT (6). 

'^Tlie actual accuracy of this probability does not extend beyond the third or fourth significant digit. 
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This string means that 690 pseudo-experiments were generated, none of them 
was more discrepant than the actual data, which means that the most likely 
p-value is 0, and the bayesian posterior ensures that the /9-value is less than 
0.01 with probabiUty 0.99904. In this case we claim discovery. 

• The next three numbers: the most likely signal position E, its lower 68% CI limit 
and its higher 68% CI limit. This interval is obtained from MINUIT, by fitting 
eq.|22] and taking the error of the parameter with TFl : : GetParError ||6||. 

• The next three numbers; same as the previous three numbers, but for parameter 
D, after fitting eq.l22] 

• The last three numbers; same as the previous three numbers, but for parameter 
A, after fitting eq.l22l 

AppendixlAlincludes the first 100 lines of the aforementioned text file. 

3.1 A discovery example 

As an example where we claim discovery, we present dataset 10, the first dataset where 
discovery is claimed. Fig. |2] summarizes the information extracted from this dataset. 

For this dataset, we estimate the most likely /5-value to be = 0. With the 690 
pseudo-experiments generated, and assuming a flat prior in [0,1], we infer that the 
p-value is less than 0.01 with probability about 0.99904. 

The signal mean is estimated at £ = 0.664 ±0.01 8. Similarly, D = 0.13 ±0.07, and 
A = 242 ± 12. It should be reminded that each bin was width 0.025, and 242/0.025 = 
9680, which is comparable to what is known about A, i.e. that it is a random variable 
around 10'*. Similarly, 0. 13/0.025 = 5.2, which is comparable to the number of events 
one can identify as signal in Fig.| 3Ib)| 

3.2 A non-discovery example 

As an example where we do not claim discovery, we present dataset 0. Fig. [3] summa- 
rizes the information extracted from this dataset. 

For this dataset, we estimate the most likely p-value to be Of course, this 
number is not so useful, because it reflects only 10 pseudo-experiments. The useful 
inference from those 10 pseudo-experiments, though, is that the p-value is greater than 
0.01 with probability indistinguishably close to 100%. 

3.3 Summary of datasets 

Of the 20000 datasets, we found 1819 where the most likely p-value was estimated to 
be < 0.01. Of the 20000 datasets, there are 107 datasets where it was decided to stop 
producing pseudo-experiments, because we ran out of time. For those 107 datasets, 57 
have estimated /9-value < 0.01, and 50 have p-value > 0.01. The reason it took too long 
to conclude was that the p-value is very close to 0.01, so many trials are required to 
discern, with 0.999 credibility, on which side of 0.01 the p-value is. However, of those 
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107 datasets where 0.999 credibility was not attained, 64 concluded with credibility 
less than 0.99, 38 concluded with credibility less than 0.9, and just 2 with credibility 
less than 0.5. Indicatively, these 2 datasets estimated the most likely /?-value to be 



Fig.|4]and Fig.|5]show 10 more examples of datasets (5 with a discovery claim and 
5 without). 

Fig.|6]summarizes the best fitting values of A, D and E in just those 1711 pseudo- 
experiments where a discovery was claimed at the level of 0.01 Type-1 error probability. 



4 Sensitivity 

4.1 The Banff Challenge sensitivity tests 

The sensitivity of the BUMPHUNTER is measured in three signal cases, as required 
by the Banff Challenge. "Sensitivity" means the probability of observing a ;5-value < 
0.01 in the presence of a specific amount and kind of signal. In all signal cases, the 
signal is injected in the nominal background distribution, which comes from integrating 

(a--E)^ 

10^*6^ '^^ in each x bin. In all cases, the signal is given by a function De 2 0.03^ . 

In the first test, we have {D,E} ~ {1010,0.1}. Integrating the signal function in 
X e [0, 1], we have a total of 75.9 events. Out of 300 pseudo-experiments, generated 
from the distribution in Fig.|3Ia)| the BUMPHUNTER /j-value was less than 0.01 in 64 
pseudo-experiments. That implies discovery probability of about 21.3%. 
The results of the second and third test are summarized in Table [T] 
Fig.|2]summarizes the expected distributions in the three sensitivity tests, and shows 
an example of pseudo-data from each expected spectrum. 



4.2 Comparison to the case of known signal shape and position 

For the sake of comparison, what would our sensitivity be if we knew the location of 
the signal and its exact shape, and we only ignored its amount (which is proportional 
to D)l In that case, obviously, the BUMPHUNTER would be unnecessary; why look at 
many places, and pay the penalty of the trials factor, when knowing exactly where the 
signal is? 

In that ideal case, we could compare the null hypothesis to the hypothesis which 
includes the specific signal and best fits the data. We could define as test statistic the 
"log likelihood ratio": 

L(Data|D) 

^ = '°g L(Data|DJo) ' ^^'^ 

where L(Data|D) is the probability of observing the data, bin by bin, assuming the 
given signal shape with parameter D, and D is the value of D which maximizes this 
likelihood. 

Running this hypothesis test, we found that in the first test we found a /5-value < 
0.01 in 1^ pseudo-experiments (probability about 58%). In the second test the same 
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E 


D 


Total signal 


BumpHunter f(/7-value < 0.01) 


likelihood ratio test 


0.1 


1010 


75.9 


64/300-21.3% 


175/300 ~ 58% 


0.5 


137 


10.3 


87/300 ~ 29.0% 


173/300 ~ 58% 


0.9 


18 


1.35 


32/300 ~ 10.7% 


112/300-37% 



Table 1: Summary of BUMPHUNTER sensitivity to the three tests posed by the Banff 
Challenge. The last column shows, for comparison, the results of the targeted likeli- 
hood ratio test of paragraph s. 21 



success rate was ^ ~ 58%. In the third test, the result was ^ ~ 37%. These numbers 
are added to Table [T] as an extra column. 

Comparing these success rates to the ones mentioned in paragraph 14.11 one con- 
firms that the BUMPHUNTER is less sensitive than a test to which the location and 
shape of the signal have been disclosed. This lower sensitivity is a consequence of 
the greater trials factor in the BUMPHUNTER, as expected from the discussion in para- 
graph [T32] Nevertheless, in research one doesn't know in advance what he is going 
to discover, unless some confirmation is sought instead of discovery. Between the less 
sensitive BUMPHUNTER, which covers a large range of possibilities, and an arbitrary 
hypothesis test that is sensitive to just one arbitrary signal and insensitive to almost 
everything else, the BUMPHUNTER seems to be a better choice. 

4.3 Sensitivity of different tunings, without re-fitting 

In this section we will compare the sensitivity of the BUMPHUNTER when it is tuned 
in the following ways: 

1 . Not using sidebands criteria, and trying all window sizes, as described in para- 
graphia 

2. Not using sidebands criteria, and constraining the window size between 3 and 
5 bins. This is the tuning used to address the Banff Challenge, as described in 
paragraphia 

3. Using sidebands criteria, and trying all window sizes, as described in para- 
graphia] This tuning was used in HI. 

4. Using sidebands criteria, and constraining the window size between 3 and 5 bins. 

In this paragraph, Hq is not obtained by re-fitting eq.|2T]to the data (or pseudo-data), 
but is always the same spectrum, which corresponds to lO^e^'"'^. 

The sensitivity of the various BUMPHUNTER tunings are compared to that of the 
targeted test of paragraph s. 21 The sensitivity of Pearson's traditional is also shown, 
where the test statistic is that of eq. [T] 

Fig. |8] shows the probability of observing p-value < 0.01 in three cases of signal, 
as a function of the expected number of signal events. The three signal cases used 
correspond to Gaussians of C7 = 0.03 and means {0.1,0.5,0.9}, according to the Banff 
sensitivity tests discussed in l4.1l and l4.2l 
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In Fig.[8]we see, as expected, that the BUMPHUNTER is always less sensitive than 
the targeted test. It is much more sensitive, though, than a simple test, except when 
the signal is at 0.9. 

In Fig.[8]it may be surprising is that the BUMPHUNTER sensitivity does not reach 
asymptotically 100% when the sidebands criteria are taken into account and the width 
of the central window is constrained. This is the risk talked about in paragraphia step|2] 
The explanation is simple. When the signal increases a lot, and the central window is 
not allowed to become wider, the sidebands start accumulating so many signal events 
that they become discrepant, so the bump candidate often disqualifies. We see that this 
doesn't happen when the sidebands are ignored, or when the size of the central window 
can vary freely. 

One may compare the sensitivity of the BUMPHUNTER without sidebands and con- 
strained width in Fig.|8]to Table [T] In Fig. [8] for the same amount of injected signal 
shown in the table (i.e. 75.9, 10.3 and 1.35), the sensitivity appears higher The differ- 
ence is that in Fig.|8]the background is known and fixed, rather than obtained by fitting 
as in Table [T] 

It is worth reminding here that, for any hypothesis test, sensitivity depends on the 
kind of signal. The conclusions of this paragraph may not apply to different signal 
shapes. 

4.4 Locating the right interval 

Here will be demonstrated how the BUMPHUNTER locates the position of injected sig- 
nal. We will refer to two of the BUMPHUNTER tunings of paragraph s. 31 tuning[T](no 
sidebands and unconstrained width) and tuning |2] (no sidebands and width constrained 
between 3 and 5 bins). The signal injected will be Gaussian of G — 0.03 and mean 0.5; 
the results are similar at mean 0.1 and 0.9. Various amounts of signal will be tried to 
show how the ability to locate the right x interval progresses. 

Let's first examine what intervals are located as most discrepant when there is no 
signal injected on top of the background of the Banff Challenge, lO^e^ Fig.|9]shows 
two examples; one with BUMPHUNTER tuning[T]and one with|2l Fig.[2tc)|and Fig.[2tf)| 
show that higher X values are less likely to be included in the most discrepant interval. 
The reason has to do with expecting too few events beyond x ~ 0.6 (see Fig. [TJ. To 
demonstrate that. Fig. [TO] shows the same as Fig. | ^c)| but for a background function 
lO'^e^'"^ instead, so as to expect over 100 events even in the highest x bin. Conse- 
quently, Fig.[TO]shows more constant probabilities, indicating that the most interesting 
window is uniformly distributed in the [0, 1] range. In Fig. [TO] one can still see a re- 
duction of probability close to x=Q and 1 . These edge effects are there because the x 
bins that are not so close to the edges have more possibilities to be included in the most 
discrepant interval; they may be in its middle, or near its end. Marginal bins, however, 
have fewer possibilities to be included; for the very last x bin, only one way exists: the 
most discrepant interval has to reach to the edge of the [0, 1] range. 

Fig, nn shows the same as Fig.|9] except that just one signal event is injected (on 
average) on top of the lO'^e^"^^ background. According to Fig.[ ^b)| the sensitivity to 
1 signal event is very low. However, in Fig. ll l|tc)| and ll l|fQ| one sees that this signal is 
enough to give the right x bins a much greater probability to be included in the most 
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discrepant interval. Fig.[T2]shows the same, but with 10 signal events injected on aver- 
age, which makes the effects more prominent. In Fig. ll^(b)| one sees that the intervals 
tend to have approximately the width of the injected signal. Fig. [13] shows the same, 
but with 40 signal events injected on average, which means that the BUMPHUNTER has 
^100% probability to return p-value < 0.01, according to Fig. [ ^b)| In this case all 
intervals are located at the right position, and have the right width, given the finite size 
of X bins which discretizes the width of the intervals returned by the BUMPHUNTER. 

5 Generalizing the BumpHunter concept 

The BumpHunter is not the only hypertest one could use, as explained in paragraph 
11.3.31 Understanding the logic behind the BumpHunter allows one to think of gen- 
eralizations of this idea. One such generalization is the TailHunter (paragraph |5.1| l. 
Another is a hypertest that combines multiple distributions (paragraph l5.2l ). Another 
hypertest, very similar to the BumpHunter, was developped previously in the HI ex- 
periment Ul, where data deficits were also considered as potential signs of new physics, 
and no sideband criteria were used. The HI hypertesQ, which obviously predates this 
work, can be viewed a-posteriori as a particular tuning of the BumpHunter. 

5.1 TailHunter 

A simple hypertest, analogous to the BumpHunter, is the TailHunter, which is 
used in HI, and is also similar to the Sleuth algorithm ISHD used in imfTOlFl 

One can think of the TailHunter as a BumpHunter without sidebands, where 
the right edge of every window is always at the last bin that contains data. The only 
requirement that remains in the definition of t (eq.fTSll is to have an excess of data with 
respect to the background. All tails are examined by local hypothesis tests, the smallest 
p-value is used to define the statistic of the TailHunter hypertest, and the /j-value of 
the TailHunter is found as explained in paragraph ll.il 

Fig. ll^(a)| presents an example of a spectrum where the TailHunter finds p-value 
less than 0.01 with credibility greater than 0.999. The spectrum is created by adding 
to dataset of the Banff Challenge some signal events that follow a uniform distribu- 
tion between and 1, with 40 signal events expected in the whole interval. The ob- 
served TailHunter statistic in this example is 17.8, far beyond the values obtained 
in pseudo-data, shown in Fig. ll^fb)] 

5.2 Combining spectra 

Another hypertest (let's refer to it as mBH for "multi-BUMPHUNTER"), allows the 
combination of two or more spectra to be scanned simultaneously. In some particle 
physics analyses this is useful, because an exotic particle may decay in many ways 

'^"This is not the terminology used by HL but looking at it from the perspective of this work, it was indeed 
a hypertest, taking the trials factor into account correctly. 

'''Besides small technical differences, the biggest difference is that SLEUTH combined many final states, 
and didn't use fixed bins. Regarding the combination of many final states, see paragraph l5.2l 
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(e.g. Z' e+e^ and Z' IJ.'^IJ.~), so the signal may populate two or more statistically 
independent distributions. When we search for bumps in the mass spectrum of decay 
products, e.g. in and m^j^, all spectra should indicate an excess at roughly the 
same mass, namely the mass of the new particle. The width of the signal, though, is 
not expected to be the same in all distributions, since different decay products may be 
measured with different experimental resolution. 

One way to extend the BUMPHUNTER into mBH is the following: The BUMPHUNTER 
statistic is first computed independently in each spectrum, and then the mBH statistic is 
defined as the sum of all BUMPHUNTER statistical^! with the extra requirement that all 
spectra must have their most interesting intervals within some distance from each other. 
The exact distance criterion can be adjusted. If bumps are found at different masses, 
then we can characterize the mBH's finding maximally uninteresting, by setting the 
mBH statistic to 0. At last, the mBH p-value is estimated as explained in paragraph 

o 

The mBH is highly sensitive to signals that appear simultaneously in two (or more) 
spectra, because all signal significances are combined at the step where the BUMPHUNTER 
statistics are summed. Obviously, the mBH described so far makes a strong assump- 
tion; that the signal has to appear simultaneously in all examined spectra. If this is 
indeed a characteristic of the signal, then mBH is more sensitive to it; otherwise it is 
not a well-motivated test. As explained in paragraph [T33] there is not a universally 
best hypertest. 

If one relaxes the extra requirement which compares the interval locations in dif- 
ferent spectra, and uses as mBH statistic the biggest BUMPHUNTER statistic instead of 
their sum, then mBH naturally reduces to the approach taken in ll2l fT0l and Q to search 
in multiple spectra without making strong assumptions. There, each spectrum is exam- 
ined independently, without checking for patterns across spectra, and without making 
any attempt to combine the significance of the findings in different spectra. The small- 
est /9-value from all spectra is noted (this corresponds to defining the mBH statistic 
as the maximum BUMPHUNTER statistic found across the examined spectra), and the 
probability is estimated of seeing a p-value as small as that, or smaller, in pseudo-data 
that follow Hq in all distributions (and this corresponds to finding the /9-value of the 
mBH). 

6 Conclusion 

After an introduction to hypothesis testing and the meaning of p-values, the issue of 
the trials factor was illustrated, and a method to deal with it was proposed, by the intro- 
duction of hypothesis hypertests. One such hypertest is the BUMPHUNTER, inspired 
by searches for exotic phenomena in high energy physics. 

The BumpHunter algorithm is presented, and its performance is demonstrated 
with the opportunity of the Banff Challenge, Problem 1 ||3|. 

'^Remember that the BUMPHUNTER statistic is the negative logarithm of a p- value, so the sum of many 
BumpHunter statistics is the negative logarithm of aproduct of p- values. Adding BUMPHUNTER statistics 
is equivalent to multiplying p- values. 
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Besides documenting the BUMPHUNTER (and TailHunter) algorithm in detail, 
the author is open to collaborating with people who need his code. Hopefully, it will 
soon be incorporated in a standard library, like RDDStats |1 IJ. 

I wish to thank Pekka Sinervo, Pierre Savard, Tom Junk, and Bruce Knuteson, for 
our fruitful discussions. 
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A First 100 lines from the Banff Challenge, problem 1 

0.9 = 9/10 PCpval>0.01)= 1000000000 
10 0.9= 9/10 PCpval>0.01)= 1000000000 

2 0.1 = 3/30 PCpval>0.01)= 0.999746 000000000 

3 1= 10/10 PCpval>0.01)= 1000000000 

4 0.8 = 8/10 P(pval>0.01)= 1000000000 

5 0.1 = 3/30 PCpval>0.01)= 0.999746 000000000 

6 0.0666667 = 4/60 P(pval>0.01)= 0.999626 000000000 

7 0.4 = 4/10 PCpval>0.01)= 1000000000 

8 0.0666667 = 6/90 PCpval>0.01)= 0.999961 000000000 

9 0.4 = 4/10 PCpval>0.01)= 1000000000 



10 


1 





= 0/690 PCpval<0.01)= 0.99904 0.663528 0.645274 0.681782 0.128468 0.061079 0.195858 242.076 230.556 253.595 


11 








.166667 = 5/30 P(pval>0.01)= 0.999999 000000000 


12 








.4 = 4/10 P(pval>0.01)= 1000000000 


13 








.6 = 6/10 P(pval>0.01)= 1000000000 


14 








.5 = 5/10 P(pval>0.01)= 1000000000 


15 








.25 = 5/20 P(pval>0.01)= 1000000000 


16 








.2 = 4/20 PCpval>0.01)= 0.999998 000000000 


17 








.5 = 5/10 P(pval>0.01)= 1000000000 


18 





1 


= 10/10 PCpval>0.01)= 1000000000 


19 








.4 = 4/10 PCpval>0.01)= 1000000000 


20 








.2 = 2/10 PCpval>0.01)= 0.999845 000000000 


21 








.8 = 8/10 P(pval>0.01)= 1000000000 
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1 





.00431373 = 11/2550 PCpval<0.01)= 0.999027 0.0907464 0.0800601 0.101433 2.5333 1.78409 3.28251 236.094 221.044 251.144 
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.3 = 3/10 P(pval>0.01)= 0.999997 000000000 


24 








. 1 = 3/30 P(pval>0.01)= 0.999746 000000000 


25 


1 





.00357143 = 7/1960 PCpval<0.01)= 0.99901 0.497455 0.488462 0.506448 0.392582 0.266279 0.518885 267.207 255.062 279.351 


26 








.0136605 = 103/7540 P(pval>0.01)= 0.999016 000000000 


27 








.0165385 = 43/2600 PCpval>0.01)= 0.999245 000000000 


28 








.6 = 6/10 P(pval>0.01)= 1000000000 


29 





1 


= 10/10 P(pval>0.01)= 1000000000 


30 








.9 = 9/10 P(pval>0.01)= 1000000000 


31 








.4 = 4/10 P(pval>0.01)= 1000000000 


32 








.0428571 = 6/140 PCpval>0.01)= 0.999411 000000000 


33 








.6 = 6/10 P(pval>0.01)= 1000000000 


34 








.5 = 5/10 P(pval>0.01)= 1000000000 


35 


1 





= 0/690 P(pval<0.01)= 0.99904 0.391657 0.385773 0.397541 1.0584 0.834799 1.282 297.514 284.591 310.436 


36 








.3 = 3/10 P(pval>0.01)= 0.999997 000000000 


37 








.2 = 2/10 P(pval>0.01)= 0.999845 000000000 


38 








.2 = 2/10 P(pval>0.01)= 0.999845 000000000 


39 








.2 = 2/10 P(pval>0.01)= 0.999845 000000000 


40 








.9 = 9/10 P(pval>0.01)= 1000000000 
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.00855856 = 266/31080 PCpval<0.01)= 0.999893 0.543414 0.52964 0.557188 0.194656 0.105582 0.283731 220.339 209.713 230.964 


42 


1 





= 0/690 P(pval<0.01)= 0.99904 0.143887 0.134607 0.153166 2.21724 1.70884 2.72564 287.405 273.936 300.875 


43 








.6 = 6/10 P(pval>0.01)= 1000000000 


44 








.2 = 2/10 P(pval>0.01)= 0.999845 000000000 


45 








.1 = 3/30 P(pval>0.01)= 0.999746 000000000 


46 








.0177778 = 32/1800 PCpval>0.01)= 0.999091 000000000 


47 








.7 = 7/10 P(pval>0.01)= 1000000000 


48 








.3 = 3/10 P(pval>0.01)= 0.999997 000000000 


49 








.5 = 5/10 P(pval>0.01)= 1000000000 


50 








.3 = 3/10 P(pval>0.01)= 0.999997 000000000 


51 








.7 = 7/10 P(pval>0.01)= 1000000000 


52 








.2 = 6/30 P(pval>0.01)= 1000000000 








.3 = 3/10 P (pval^O . 01 ) = . 999997 000000000 


54 





1 


= 10/10 P(pval>0.01)= 1000000000 


55 








.133333 = 4/30 P(pval>0.01)= 0.999986 000000000 


56 








.2 = 2/10 P(pval>0.01)= 0.999845 000000000 


57 








.05 = 5/100 P(pval>0.01)= 0.999437 000000000 


58 








.4 = 4/10 P(pval>0.01)= 1000000000 


59 








.2 = 2/10 P(pval>0.01)= 0.999845 000000000 


60 








.4 = 4/10 P(pval>0.01)= 1000000000 


61 








.7 = 7/10 P(pval>0.01)= 1000000000 


62 








.4 = 4/10 P(pval>0.01)= 1000000000 


63 








.15 = 3/20 P(pval>0.01)= 0.999948 000000000 


64 








.6 = 6/10 P(pval>0.01)= 1000000000 


65 








.3 = 3/10 P(pval>0.01)= 0.999997 000000000 


66 








.0368421 = 7/190 PCpval>0.01)= 0.999247 000000000 


67 








.25 = 5/20 PCpval>0.01)= 1000000000 


68 





1 


= 10/10 P(pval>0.01)= 1000000000 


69 








.4 = 4/10 P(pval>0.01)= 1000000000 


70 








.3 = 3/10 P(pval>0.01)= 0.999997 000000000 


71 








.7 = 7/10 P(pval>0.01)= 1000000000 


72 








.0571429 = 4/70 P(pval>0.01)= 0.999247 000000000 


73 


1 





.00431373 = 11/2550 P(pval<0.01)= 0.999027 0.507966 0.496324 0.519609 0.391159 0.262968 0.51935 275.858 263.193 288.522 


74 








.4 = 4/10 P(pval>0.01)= 1000000000 


75 








.1 = 4/40 P(pval>0.01)= 0.999944 000000000 


76 








.2 = 2/10 P(pval>0.01)= 0.999845 000000000 


77 








.2 = 2/10 P(pval>0.01)= 0.999845 000000000 


78 








.5 = 5/10 P(pval>0.01)= 1000000000 


79 








.4 = 4/10 P(pval>0.01)= 1000000000 


80 








.4 = 4/10 P(pval>0.01)= 1000000000 


81 


1 





.00230769 = 3/1300 P(pval<0.01)= 0.99902 0.508847 0.495754 0.521941 0.24085 0.135517 0.346183 251.84 240.059 263.621 


82 








.5 = 5/10 P(pval>0.01)= 1000000000 


83 








.8 = 8/10 P(pval>0.01)= 1000000000 


84 








.7 = 7/10 P(pval>0.01)= 1000000000 


85 








.2 = 4/20 P(pval>0.01)= 0.999998 000000000 


86 








.3 = 3/10 P(pval>0.01)= 0.999997 000000000 


87 








.2 = 2/10 P(pval>0.01)= 0.999845 000000000 
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88 0.5 = 5/10 PCpval>0.01)= 1000000000 

89 1 = 0/690 P(;pval<0.01)= 0.99904 0.961137 0.885444 1.03683 6.06533 -670.853 682.983 245.924 234.977 256.871 

90 0.166667 = 5/30 PCpval>0.01)= 0.999999 000000000 

91 0.8 = 8/10 PCpval>0.01)= 1000000000 

92 0.075 = 3/40 P(pval>0.01)= 0.999246 000000000 

93 0.0473684 = 9/190 PCpval>0.01)= 0.999973 000000000 

94 0.7 = 7/10 P(pval>0.01)= 1000000000 

95 0.5 = 5/10 P(pval>0.01)= 1000000000 

96 0.6 = 6/10 PCpval>0.01)= 1000000000 

97 0.0333333 = 9/270 PCpval>0.01)= 0.999529 000000000 

98 0.0571429 = 4/70 P(pval>0.01)= 0.999247 000000000 

99 0.9 = 9/10 PCpval>0.01)= 1000000000 
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Figure 1: Fitting an exponential spectrum with a Gaussian signal, like in 131 . In one 
case the whole spectrum is fitted, and in the other the algorithm described in para- 
graph lZTTI locates the anomalous region and fits the rest of the spectrum. 
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(a) (b) 




BumpHunter statistic Signal position Signal position 



(c) (d) (e) 

Figure 2: |(a)[ The data of dataset 10, with the result of fitting eq.|2T|as described in 
paragraph l2.1.1l The bottom of the figure compares the data (D) to the background (B) 
in each bin, using the approximation of significance. The blue vertical lines show 
the most discrepant bump found, namely the central window of the local hypothesis 
test which yielded the smallest p-value. |(b)[ The fit of eq. |22] to the data. |(c)[ The 
distribution of the BumpHunter statistic in 690 pseudo-experiments (f) generated to 
follow the distribution obtained by the fit in[(a)] The observed BumpHunter statistic 
(to) is marked by the blue arrow. |(d)| The 2-dimensional 0.5c7 (red). Iff (black), and 
2ff (blue) confidence contour for the signal position and amount. The black marker 
and the error bars correspond to the most likely values and the uncertainty returned 
by TFl : : GetParError. |(e)| Same as |(d)| but showing the signal position and slope 
parameter A. 
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(d) 



(e) 



Figure 3: Same as Fig. |2] but for dataset 0, where most likely there is no signal. An 
obvious difference is that only 10 pseudo-experiments as generated, 9 of which have a 
bigger BUMPHUNTER statistic than observed, as shown in|(c)| 
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0.2 0.4 o.s o.a 










0.2 0.4 O.S O.a 



Figure 4: Summary of the results from 5 Banff Challenge Problem 1 datasets, where 
no discovery was claimed. The datasets are {100, 400, 500, 700, 800}, and one row of 
figures corresponds to each respectively. We see from the 2-dimensional contours that 
parameters D and E are poorly constrained, because there is not significant signal in the 
data to constrain them. The corresponding most likely p-values are; { "^5 1 50 ' TO ' TO > TO J 
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0.2 0.4 o.s o.a 










Figure 5: Summary of results from 5 Banff Challenge Problem 1 datasets, where a 
discovery was claimed. The datasets are {22, 25, 35, 41, 42}. One row of figures 
corresponds to each. In the 3'^'^ row, 3'^'^ column, the blue arrow is missing because 
the observed BUMPHUNTER statistic t,, = 24.9 is outside the plotted range. The same 
happens in dataset 42, last row, with to = 14.9. The corresponding most likely p-values 

r JJ 7 0_ 266 _0_1 

I 2250 ' 1960 ' 690 ' 31080 ' 690 J' 
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A 

(a) (b) 




D E 

(c) (d) 



Figure 6: For the 1819 datasets where p-value was estimated to be < 0.01, the distribu- 
tions of fitted parameters {A,E,D} are shown, as well as the joint distribution of D and 
E in |(d)| To remove the effect of binning, A and D are shown after dividing the fitted 
values by the bin size (0.025). We have the information that A is actually following a 
Gaussian of mean =10^ and standard deviation = 1000, and we see that the A we obtain 
with our procedure is distributed in a very similar way, even though this population of 
datasets is the subset where signal exists, therefore the estimation of A becomes more 
challenging. 
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Figure 7: The expected spectrum in each of the three sensitivity tests listed in paragraph 
14.11 In each case an example of pseudo-data is shown, with the blue vertical lines 
bracketing the central window of the most discrepant bump found in each pseudo- 
spectrum. 
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Figure 8: The sensitivity of various BUMPHUNTER tunings and of the targeted test 
of paragraph 14. 21 as a function of the expected value of signal events. In the legend, 
"narrow" implies bump width (Wc) constrained between 3 and 5 bins (see paragraph^. 
Figures [(a)]|(b)| and |(c)| correspond to Gaussian signal injected with a ~ 0.03 and mean 
value 0.1, 0.5 and 0.9 respectively. The targeted likelihood ratio test of paragraph s. 21 
is shown in addition to four BUMPHUNTER tunings, and the test. 
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Interval number 



Interval width 



(a) 



(b) 



Interval number 



0.10 0.12 

Interval widtti 



(d) 



(e) 



(c) 



+++{ 



(f) 



Figure 9: Ttiis figure is made using pseudo-experiments wtiere signal is injected 
on top of ttie background given by lO^e^'"'^. Ttie first row (Fig. |(a)[ |(b)[ |(c)| i show 
ttie results of BUMPHUNTER tuning [T] (unconstrained window size), and the second 
row (Fig. |9(dj] |9(e)[ [9(f)| show the results of tuning |2] (window between 3 and 5 bins, 
which corresponds to intervals of length 0.075 to 0.125 in x). Fig. |9(a)| shows the most 
discrepant intervals found in 100 pseudo-experiments. The x-coordinate is an integer 
that enumerates each interval, and the y-axis shows the x-iange of each interval with 
a black line for odd pseudo-experiments and a red line for even pseudo-experiments. 
Fig. |9(b)| shows the probability distribution of the width of the most discrepant interval 
located in a pseudo-experiment. The probability distribution is estimated from a sample 
of 1000 pseudo-experiments, and the uncertainties are binomial. Fig. |9(c)| shows the 
probability each bin of x has to be included in the most discrepant interval located 
in a pseudo-experiment. This is not a probability distribution; the sum of its bins is 
not equal to 1. It may be understood looked at bin-by-bin; for example the 1st bin, 
X € [0,0.025], has probability ~4.5% to be included in the most discrepant interval 
the BumpHunter locates in a pseudo-experiment. These probabihties are estimated 
using 1000 pseudo-experiments, and the uncertainties are binomial. 
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.0 



Position 



Figure 10: Same as Fig. [ 2Jc)[ except that the pseudo-experiments are generated from 
the background 10^e~^^^, so as to have large event counts in all x bins. 
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Figure 11; Same as Fig.|9] except ttiat just 1 signal event is expected, distributed as a 
Gaussian with a = 0.03 and mean 0.5. 
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Figure 12: Same as Fig.[TT] except that 10 signal events are expected, distributed as a 
Gaussian with o = 0.03 and mean 0.5. 
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Figure 13; Same as Fig.[TT] except that 40 signal events are expected, distributed as a 
Gaussian with a = 0.03 and mean 0.5. 




0.0 0.2 0.4 0.6 0.8 1.0 5 10 

X TaiiMunter statistic 



(a) (b) 

Figure 14: Fig. |(a)| shows an example of spectrum where the TailHunter locates a 
significant high-x tail. The spectrum has been constructed to contain indeed signal uni- 
formly distributed between and 1 (see paragraph s. 11 1. Fig. |(b)| shows the distribution 
of the TailHunter statistic under //q- The observed TailHunter statistic is 17.8, 
unmatched by any of the 690 pseudo-experiments generated, implying a /9-value less 
than 0.01 with probability that exceeds 0.999. 
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